Some sort of introductory text to what cleaning data even means.
1. Setup
First, we need to load the packages that are going to help facilitate the data analysis. You don’t need to worry about what each package does, but this is an important setup step that allows us to use specific functions later on. I always include tidyverse, janitor and lubridate at the top of every R file.
If you want to learn more about what each of those packages does,
Tip
Add a “code chunk” to your file by hitting Command+Option+i on the keyboard. You’ll run all of your code within these chunks, so get used to this keyboard shortcut!
Installing the packages
Loading them in the R file
Copy and paste the code below and “run” it by hitting the green play button in the top right of your code chunk.
# setup library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(lubridate)
Tip
You’ll see some sort of message pop up after you run the code; as long as there’s no red “error” warning, you’re fine to move on.
2. Importing the data from LegiScan
Now you need to take the data that we downloaded to your computer and import it into this R file. It sounds complicated, but all it takes is a simple line of code.
2.1 Create your data folders
If you open the files we downloaded from LegiScan in the previous step, you’ll see it actually includes seven different data tables. For the purposes of this project, we’re only going to be working with three of those: “bills,” “sponsors,” and “people.”
Create a new code chunk using the keyboard shortcut from above. Copy and run the code below to import those three tables into this file.
Rows: 9897 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): bill_number, status_desc, title, description, committee, last_acti...
dbl (4): bill_id, session_id, status, committee_id
date (2): status_date, last_action_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 14190 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): bill_id, people_id, position
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 181 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): name, first_name, middle_name, last_name, suffix, nickname, party,...
dbl (7): people_id, party_id, role_id, followthemoney_eid, votesmart_id, kn...
lgl (1): opensecrets_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
lawmakers
3. “Cleaning” those tables
Oftentimes, data tables aren’t in the best shape when you download them. There might be rows or cells that are missing entries, or there may be columns that aren’t necessary for the analysis you want to do. This is where cleaning comes in: you can use code to adjust the table/remove data you don’t need so the final table is much easier to work with.
3.1 Bills
Let’s clean up each of the data tables one at a time. We’ll start with our “bills” data.
A) View the raw data
Before we start cleaning up the tables, let’s take another look at what they look like untouched. Run this code to view the table you just imported.
Now that we’ve selected the columns we want to work with in each of the three data tables, we need to merge everything together into one big table that we’ll use for our analysis. You’ll notice there are a couple of repeat columns in the tables, such as people_id and bill_id. We’ll want to make sure those only appear once in our big table. Let’s do this in two steps: first, copy the code below to merge the lawmaker and sponsor tables together.
Great! You should have one big data table with the following information:
people_id = the unique number associated with each lawmaker
name = lawmaker’s first and last name
mid_init = lawmaker’s middle initial
suffix = lawmaker’s suffix, if applicable
party = political party of each lawmaker
role = whether the lawmaker is a senator or a representative
district = the congressional district associated with each lawmaker
bill_id = the unique number associated with a specific bill filed this session
bill_number = standard bill naming in the legislature
spons_position = for a specific bill, this number indicates where a lawmaker is in the line of sponsor (ex: “1”=primary sponsor, “7”=seventh sponsor, etc.)
status = a numerical representation of where the bill currently sits (ex: “0”=no progress, “1”=Introduced, “2”=Engrossed, “4”=Passed) There is no 3, I don’t know why.
status_desc = written description of the status
status_date = date a specific bill achieved the corresponding status
title = shorter description of bill as written in LegiScan
description = longer description of bill as written in LegiScan
committee = which Senate/House committee the bill was referred to
committee_id = unique code associated with each committee
last_action = most recent update to bill as of the date you downloaded the data
last_action_date = calendar date of that update
This is a lot of information to work with! We may not end up using every column, but it’s still important to have a full picture of the data available. The table has over 14,000 rows of data, each corresponding with a bill/joint resolution. Now you can see the appeal of sorting through everything in R, rather than manually or in a spreadsheet.
Add callout note about companion bills
We’ll be able to quickly sort through all of this information and identify key details about the agendas of Texas lawmakers in the current legislative session. Let’s save this table before we move forward with our analysis.
5. Saving the clean table
Add instructions for creating a data-processed file.